Goto

Collaborating Authors

 soundness and completeness


Thought of Search: Planning with Language Models Through The Lens of Efficiency

Neural Information Processing Systems

Among the most important properties of algorithms investigated in computer science are soundness, completeness, and complexity. These properties, however, are rarely analyzed for the vast collection of recently proposed methods for planning with large language models. In this work, we alleviate this gap. We analyse these properties of using LLMs for planning and highlight that recent trends abandon both soundness and completeness for the sake of inefficiency. We propose a significantly more efficient approach that can, at the same time, maintain both soundness and completeness. We exemplify on four representative search problems, comparing to the LLM-based solutions from the literature that attempt to solve these problems. We show that by using LLMs to produce the code for the search components we can solve the entire datasets with 100% accuracy with only a few calls to the LLM. In contrast, the compared approaches require hundreds of thousands of calls and achieve significantly lower accuracy. We argue for a responsible use of compute resources; urging research community to investigate sound and complete LLM-based approaches that uphold efficiency.


VERINA: Benchmarking Verifiable Code Generation

Ye, Zhe, Yan, Zhengxu, He, Jingxuan, Kasriel, Timothe, Yang, Kaiyu, Song, Dawn

arXiv.org Artificial Intelligence

Large language models (LLMs) are increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging and often requires costly manual review. Verifiable code generation -- jointly generating code, specifications, and proofs of code-specification alignment -- offers a promising path to address this limitation and further unleash LLMs' benefits in coding. Yet, there exists a significant gap in evaluation: current benchmarks often focus on only individual components rather than providing a holistic evaluation framework of all tasks. In this paper, we introduce Verina (Verifiable Code Generation Arena), a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions. Verina consists of 189 manually curated coding tasks in Lean, with detailed problem descriptions, reference implementations, formal specifications, and extensive test suites. Our extensive evaluation of state-of-the-art LLMs reveals significant challenges in verifiable code generation, especially in proof generation, underscoring the need for improving LLM-based theorem provers in verification domains. The best model, OpenAI o4-mini, achieves a 61.4\% code correctness rate, 51.0\% for specification soundness and completeness, and a mere 3.6\% proof success rate (based on one trial per task). We hope Verina will catalyze progress in verifiable code generation by providing a rigorous and comprehensive benchmark. We release our dataset on https://huggingface.co/datasets/sunblaze-ucb/verina and our evaluation code on https://github.com/sunblaze-ucb/verina.


Thought of Search: Planning with Language Models Through The Lens of Efficiency

Neural Information Processing Systems

Among the most important properties of algorithms investigated in computer science are soundness, completeness, and complexity. These properties, however, are rarely analyzed for the vast collection of recently proposed methods for planning with large language models. In this work, we alleviate this gap. We analyse these properties of using LLMs for planning and highlight that recent trends abandon both soundness and completeness for the sake of inefficiency. We propose a significantly more efficient approach that can, at the same time, maintain both soundness and completeness.


Automating Thought of Search: A Journey Towards Soundness and Completeness

Cao, Daniel, Katz, Michael, Kokel, Harsha, Srinivas, Kavitha, Sohrabi, Shirin

arXiv.org Artificial Intelligence

Planning remains one of the last standing bastions for large language models (LLMs), which now turn their attention to search. Most of the literature uses the language models as world models to define the search space, forgoing soundness for the sake of flexibility. A recent work, Thought of Search (ToS), proposed defining the search space with code, having the language models produce that code. ToS requires a human in the loop, collaboratively producing a sound successor function and goal test. The result, however, is worth the effort: all the tested datasets were solved with 100% accuracy. At the same time LLMs have demonstrated significant progress in code generation and refinement for complex reasoning tasks. In this work, we automate ToS (AutoToS), completely taking the human out of the loop of solving planning problems. AutoToS guides the language model step by step towards the generation of sound and complete search components, through feedback from both generic and domain specific unit tests. We achieve 100% accuracy, with minimal feedback iterations, using LLMs of various sizes on all evaluated domains.


A Dual-Perspective Approach to Evaluating Feature Attribution Methods

Li, Yawei, Zhang, Yang, Kawaguchi, Kenji, Khakzar, Ashkan, Bischl, Bernd, Rezaei, Mina

arXiv.org Artificial Intelligence

Feature attribution methods attempt to explain neural network predictions by identifying relevant features. However, establishing a cohesive framework for assessing feature attribution remains a challenge. There are several views through which we can evaluate attributions. One principal lens is to observe the effect of perturbing attributed features on the model's behavior (i.e., faithfulness). While providing useful insights, existing faithfulness evaluations suffer from shortcomings that we reveal in this paper. In this work, we propose two new perspectives within the faithfulness paradigm that reveal intuitive properties: soundness and completeness. Soundness assesses the degree to which attributed features are truly predictive features, while completeness examines how well the resulting attribution reveals all the predictive features. The two perspectives are based on a firm mathematical foundation and provide quantitative metrics that are computable through efficient algorithms. We apply these metrics to mainstream attribution methods, offering a novel lens through which to analyze and compare feature attribution methods.


Learning Features and Abstract Actions for Computing Generalized Plans

Bonet, Blai, Francès, Guillem, Geffner, Hector

arXiv.org Artificial Intelligence

Generalized planning is concerned with the computation of plans that solve not one but multiple instances of a planning domain. Recently, it has been shown that generalized plans can be expressed as mappings of feature values into actions, and that they can often be computed with fully observable non-deterministic (FOND) planners. The actions in such plans, however, are not the actions in the instances themselves, which are not necessarily common to other instances, but abstract actions that are defined on a set of common features. The formulation assumes that the features and the abstract actions are given. In this work, we address this limitation by showing how to learn them automatically. The resulting account of generalized planning combines learning and planning in a novel way: a learner, based on a Max SAT formulation, yields the features and abstract actions from sampled state transitions, and a FOND planner uses this information, suitably transformed, to produce the general plans. Correctness guarantees are given and experimental results on several domains are reported.


Semantic Attachments for Domain-Independent Planning Systems

Dornhege, Christian (University of Freiburg) | Eyerich, Patrick (University of Freiburg) | Keller, Thomas (University of Freiburg) | Trüg, Sebastian (University of Freiburg) | Brenner, Michael (University of Freiburg) | Nebel, Bernhard (University of Freiburg)

AAAI Conferences

Solving real-world problems using symbolic planning often requires a simplified formulation of the original problem, since certain subproblems cannot be represented at all or only in a way leading to inefficiency. For example, manipulation planning may appear as a subproblem in a robotic planning context or a packing problem can be part of a logistics task. In this paper we propose an extension of PDDL for specifying semantic attachments. This allows the evaluation of grounded predicates as well as the change of fluents by externally specified functions. Furthermore, we describe a general schema of integrating semantic attachments into a forward-chaining planner and report on our experience of adding this extension to the planners FF and Temporal Fast Downward. Finally, we present some preliminary experiments using semantic attachments.